Identification of Plasma Biomarkers from Rheumatoid Arthritis Patients Using an Optimized Sequential Window Acquisition of All THeoretical Mass Spectra (SWATH) Proteomics Workflow

Rheumatoid arthritis (RA) is a systemic autoimmune and inflammatory disease. Plasma biomarkers are critical for understanding disease mechanisms, treatment effects, and diagnosis. Mass spectrometry-based proteomics is a powerful tool for unbiased biomarker discovery. However, plasma proteomics is significantly hampered by signal interference from high-abundance proteins, low overall protein coverage, and high levels of missing data from data-dependent acquisition (DDA). To achieve quantitative proteomics analysis for plasma samples with a balance of throughput, performance, and cost, we developed a workflow incorporating plate-based high abundance protein depletion and sample preparation, comprehensive peptide spectral library building, and data-independent acquisition (DIA) SWATH mass spectrometry-based methodology. In this study, we analyzed plasma samples from both RA patients and healthy donors. The results showed that the new workflow performance exceeded that of the current state-of-the-art depletion-based plasma proteomic platforms in terms of both data quality and proteome coverage. Proteins from biological processes related to the activation of systemic inflammation, suppression of platelet function, and loss of muscle mass were enriched and differentially expressed in RA. Some plasma proteins, particularly acute-phase reactant proteins, showed great power to distinguish between RA patients and healthy donors. Moreover, protein isoforms in the plasma were also analyzed, providing even deeper proteome coverage. This workflow can serve as a basis for further application in discovering plasma biomarkers of other diseases.


Introduction
Rheumatoid arthritis (RA) is an inflammatory autoimmune disease that can lead to significant joint destruction (cartilage/bone erosion) and muscle wasting (cachexia) if left untreated.Blood-borne factors associated with inflammation and tissue breakdown are readily detectable in patients with severe RA [1].This has led to the development of valuable diagnostic tools, such as the C-reactive protein (CRP) test and 14-3-3 eta protein assay, which enable clinicians to make a more accurate diagnosis of RA [2].However, with the advent of several effective, though limited, remittive therapies, the demand for more sensitive biomarker tools to discriminate and predict patients' responses has greatly increased [3].This demand has propelled investigations into identifying novel bloodborne biomarkers using global proteomic profiling that analyzes samples in an untargeted broad-scale approach.Among the available plasma biomarker discovery approaches, including the Proximity Extension Assay (Olink), aptamer-based (SomaScan) [4,5], and mass spec-based proteomics, only the MS-based approach is reagent-independent and truly unbiased.
Identifying biomarkers from plasma using mass spec proteomics analysis is exceptionally challenging not only in biology but also from an analytical perspective.On one hand, the concentrations of plasma proteins can span an extremely wide range of over 10 orders of magnitude (from ~80 mg/mL to pg/mL), while on the other hand, a small number of high-abundance proteins consist of over 90% plasma protein content [6].Therefore, mass spec signal suppression from the high-abundance proteins can prevent the efficient detection of a large population of low-abundance proteins using mass spectrometry (MS)-based proteomics, which limits the proteome coverage of plasma compared to other matrixes [7].
To overcome these intrinsic issues of plasma proteomics, efforts have been made in this field on improving sample preparation: (a) depletion of high-abundance proteins [6]; (b) extensive fractionation to make low-abundance proteins detectable and utilizing advanced MS technologies, including [7]; (c) DIA (Data-Independent Acquisition) including SWATH (Sequential Window Acquisition of all THeoretical mass spectra) with fast scanning [8]; and (d) advanced PASEF (Parallel Accumulation-SErial Fragmentation technique) to further increase proteome depth and throughput, which is critical for making the workflow practical for biomarker discovery studies [9,10].One attractive approach for plasma proteomic analysis to highlight is to integrate nanoparticle (NP) protein coronas with LC-MS [11,12].The engineered NPs contain various physicochemical properties for nano-bio interactions, and each NP can interrogate hundreds of proteins in an unbiased manner to execute a highly parallel protein separation prior to MS.Using a 96-well automated workflow (Proteograph TM ), a panel of five NPs detected >2000 proteins across 141 plasma samples with DIA-MS (EKSPERT nano-LC425 -Triple TOF 6600+, Framingham, MA, USA) in a non-small cell lung cancer classification study; in contrast, only >100 proteins were detected from neat plasma without any depletion or fractionation [11].This platform leverages a rapid and deep profiling of the plasma proteome.However, it relies heavily on expensive NP products (Proteograph TM XT assay kit) and automation instruments (SP100 automation, Seer, Redwood City, CA, USA), which increases the budget burden and makes it impractical for large-scale studies, especially in startup laboratories.In addition, the required plasma sample volume could be up to 250 µL, instead of the traditional <10 µL, which may not be practical for sample procurement in animals (e.g., mice) and certain clinical studies.Another recently developed platform utilizes selective plasma protein precipitation by perchloric acid (perCA) [9], resulting in the detection of >1300 proteins per run and up to 60 SPD (samples per day), with an Evosep LC connected to a timsTOF Pro, enabling DIA-PASEF.This approach has shown the high-throughput ability to process >3000 samples without an obvious batch effect and reduces the cost to ~$2.5 per sample.However, this time and cost-effective platform still requires 50 µL of plasma and a higher-end LC/MS system, such as timsTOF Pro or equivalent, which limits the sample procurements and laboratories facilities.
Compared to the platforms mentioned above, conventional MS-based proteomics is more cost effective, requires less sample volume, and is more suitable for large-scale analysis.In the plasma proteomics workflow used in this study, we combined the construction of a comprehensive experiment-specific spectral library, plate-based sample preparation including high-abundance protein depletion, plate-based mixed-mode solid-phase extraction sample cleanup, and SWATH-DIA mass spec data acquisition, followed by spectral library data matching in DIA-NN.This workflow requires only 10 µL of plasma to achieve good proteome coverage.We further optimized each step to enforce a balance between proteome coverage and throughput.As it does not rely on high-end instruments or high volumes of plasma samples, this workflow could be a practical option for large-scale plasma proteomics, especially for plasma proteomics applications with limited sample volumes.
In this study, 80 plasma samples from RA patients (n = 60) and healthy donors (n = 20) were analyzed using the optimized SWATH proteomic workflow, resulting in 663 proteins being identified and quantified on average across subjects with low missing values (7.6% for all samples, 5.2% for healthy controls).Differentially expressed plasma proteins were identified using statistical analysis, and associated pathways were determined using overrepresentation analysis (ORA).The random forest algorithm was used to identify proteins that could discriminate between RA and healthy plasma samples.To further evaluate the performance of this workflow, we investigated the consistency between this study and other proteomic studies with not only plasma but also synovial tissue and synovial fluid samples from RA patients.

Human Plasma Samples
The subset of patients included in this study was randomly selected from patients enrolled in the SELECT-COMPARE study (NCT02629159) [13] who consented to exploratory research.All patients had active RA despite treatment with methotrexate for at least three months.Only baseline plasma samples (K2-EDTA blood collection) were selected for this experiment.EDTA plasma samples from healthy volunteers were obtained from Conversant Biologics, Inc., Huntsville, Alabama, USA.

Sample Preparation
To prepare the experiment-specific plasma peptide spectrum library, plasma from RA patients and healthy donors was pooled.Briefly, an equal volume (30 µL) of plasma from each of the study cohorts, 60 RA patient samples and 20 healthy controls, was pooled, resulting in a total of ~2.4 mL of pooled plasma at a 3:1 v/v ratio of RA to control ratio.The rationale is to have a representation of disease vs. healthy subjects based on the distribution of the study cohort.

Size Exclusion Chromatography (SEC) Fractionation
SEC chromatography was used as protein-level fractionation to reduce the complexity of plasma proteins before building the peptide spectral library.Pooled human plasma samples (1 mL) were filtered using a 0.2 µm membrane centrifuge device (CN# ODM02C34, PALL, Marlborough, MA, USA), and fractionated using a Superdex-200 10/300L SEC column (CN# 28990944, GE Health, Marlborough, MA, USA) on a Dionex Ultimate 3000 LC system with an Analytics SFM Sample and Fraction Manager.The flow rate was 0.9 mL/min with PBS buffer (pH7.4).Gel protein markers (CN#151-1901, Bio-Rad, Hercules, CA, USA) were used to determine protein molecular weight (MW).Peak-based fraction collection was detected using UV absorbance signals at 280 nm.Plasma samples were fractionated into four fractions with a recovery rate of approximately 70%.After the fraction collection, each fraction was concentrated using a 30 kDa molecular weight cut-off centrifugal filter (Amicon Ultra, 15 mL, CN UFC903096, MilliporeSigma, Burlington, MA, USA), and the buffer was exchanged into 25 mM Tris pH7.4 using the Amicon MWCO filter for downstream trypsin digestion and SAX fractionation.

Strong Anion Exchange (SAX) Fractionation
Strong anion exchange cartridges were used for peptide-level fractionation.Five aliquots of the plasma SEC fraction (~1 mL for each aliquot) were digested and used for peptide-level SAX fractionation.Briefly, 10 µL 0.5 M dithiothreitol (in 25 mM Tris, pH7.4) was added to the 1 mL of SEC fractionated plasma sample and incubated at 37 • C for 30 min to reduce the plasma proteins, followed by adding 40 µL 0.5 M iodoacetamide (IAM, Sigma-Aldrich, St. Louis, MO, USA, in 25 mM Tris, pH7.4) and incubated at room temperature in the dark for 30 min.Then, 70 µg of trypsin (Trypsin gold, MS grade, Promega, Madison, WI, USA) was added to digest the plasma proteins at a final concentration of approximately 1:50 trypsin-to-protein ratio (w/w).Strong anion exchange (SAX) fractionation was performed using a Pierce Strong Anion Exchange Spin Column (Thermo Fisher Scientific, Waltham, MA, USA) following the manufacturer's instructions.After conditioning the spin column with 20 mM Tris-HCl (pH8.0),tryptic peptide digests (3.5 mg) were loaded and eluted through the column under a centrifugal force of 500× g.Flow through and additional fractions were collected by washing stepwise with 20 mM Tris-HCl (pH8.0) and increasing NaCl concentrations.A total of 5 SAX fractions were collected with final concentrations of NaCl at 0.02 M, 0.1 M, 0.25 M, 0.5 M, and 1 M NaCl, respectively.

High-pH Reverse-Phase Fractionation
The SAX fractionated peptides were further fractionated using High-pH RPLC using an XBridge C18 4.6 × 150 mm analytical column (Waters, Milford, MA, USA) at a flow rate of 1 mL/min on an Agilent 1100 HPLC system (Agilent, Santa Clara, CA, USA)The mobile phases were 10 mM ammonium formate (pH9.0) in LC-MS-grade water (phase A) and 10 mM ammonium formate (pH9.0) in 90% acetonitrile (phase B).In total, 96 fractions were collected from each of the 5 SAX fractions through a 90 min gradient and then combined into 12 fractions following a fraction concatenation strategy [14,15].HPLC fractions were completely dried using SpeedVac (Thermo Fisher Scientific, Waltham, MA, USA) and reconstituted with 20 µL of HPLC-grade H 2 O, 2% acetonitrile, and 0.1% formic acid (FA) for LC/MS analysis.2.2.5.Top 14 High-Abundance Proteins Depleted for Peptide Spectral Library Building A parallel fractionation strategy was performed using depleted plasma.Briefly, 1 mL of pooled human plasma was depleted using the top 14 high-abundance protein depletion protocol based on the procedure of Section 2.3.1.The depleted plasma samples were reduced by Tris(2-carboxyethyl) phosphine (TCEP), alkylated by IAM, and digested by trypsin based on the procedure in Section 2.3.1.The resulting digested peptides were fractionated using the SAX and high-pH reverse-phase fractionation workflow, as described in Sections 2.2.3 and 2.2.4.

Liquid Chromatography and Mass Spectrometry Analysis Using the DDA Method
An experiment-specific peptide spectral library was constructed using a data-dependent (DDA) mass spectrometry method.The detailed procedure is based on a previous study [16] with minor revisions.Briefly, digested human plasma peptide samples from the nondepleted or depleted plasma fractionation strategies were analyzed using a capillary flow LC/MS system containing a Dionex UltiMate 3000 RSLC system (Thermo Fisher Scientific, Waltham, MA, USA) coupled online to a TripleTOF ® 6600 hybrid tandem mass spectrometer (SCIEX, Framingham, MA, USA) interfaced with the DuoSpray ESI source.Mobile phase A was 0.1% formic acid in 2% dimethyl sulfoxide (DMSO) in deionized water, and mobile phase B was 0.1% formic acid in 2% DMSO in acetonitrile.Then, 3-4 µg of digested peptide samples after high-pH reverse-phase fractionation was spiked with iRT peptide (Biognosys, Zurich, Switzerland) samples at a ratio of iRT/sample at 1/10 (v/v).Samples were injected into an autosampler and loaded onto a PepMAP100 C18 trap column (5 µm, 100 Å, 300 µm i.d.× 5 mm, Thermo Fisher Scientific, Waltham, MA, USA) and further gradient eluted and separated on a nanoEase m/z Peptide CSH C18 column (1.7 µm, 130 Å, 300 µm × 150 mm, Waters, Milford, MA, USA) with a flow rate of 3.1 µL/min.The HPLC gradient length was 180 min and the column temperature was 50 • C. MS1 was scanned from m/z 360-1500.The top 50 precursors were selected for fragmentation in EPI mode.The product ion mass spectrometry (MS) ranged from 100 to 1800 amu in the high sensitivity mode.Precursors with charges of 2 to 5 were selected for fragmentation, with an exclusion for 30 s after one occurrence.The accumulation times of DDA analysis were 0.25 s for precursor ions and 0.15 s for production ions.The precursor ions were fragmented using rolling collision energy.The total mass spec cycle time was 7.8 s.The TripleTOF instrument was calibrated after each sample in both MS1 and MS2 modes through LC/MS injection of beta-galactosidase trypsin-digested standards (Sciex, Framingham, MA, USA).

Peptide Spectral Library Generation
In total, 222 (165 non-depleted plasma + 57 depleted plasma) DDA files were selected from a total of 360 files to build the spectral library based on the richness of unique peptides and the spectral quality of the DDA file.DDA MS raw files were used to perform a protein database search using MaxQuant software (version 1.6.7.0, [17]) against the Uniprot-Swissprot human protein database downloaded on 12 January 2021.The search results were imported into Spectronaut version 12.0.20491.0.14754 (Biognosys) and combined into a spectral library.Search results with no unique protein groups were excluded from library generation, resulting in a final library consisting of 165 fractions from non-depleted plasma (SEC-based protein fractionation plus SAX and high-pH reverse-phase fractionation), and 57 fractions from depleted plasma (only SAX and high-pH reverse-phase fractionation).This final library, containing 21,670 precursors and 1352 protein groups, was used for library matching of SWATH-DIA data from the study samples.

In Silico Spectral Library Generation for Protein Isoform Analysis
To obtain a comprehensive spectral library for protein isoform analyses, we performed in silico digestion of human protein libraries from UniProt (SWISS-Prot protein isoform sequences downloaded on 12 January 2021) using DIA-NN version 1.8.1.

Sample Preparation
The high-abundance proteins in human plasma were depleted with Top 14 Abundant Protein Depletion Resin (Cat# A36372, Thermo Scientific, Waltham, MA, USA) in a 96-well plate to improve the throughput of plasma processing.Resin (400 µL ) was added to a 3M Empore 96-Well High-Performance Extraction Disk Filter Plate (#6065), which was fitted on top of a 1 mL deep-well plate.To reduce the sample-to-sample variant during sample preparation and control the batch effect, samples from RA patients and healthy controls were randomized before plate-based sample preparation.Plasma (10 µL ) was added to the wells, which were then incubated with depletion resins for 30 min.After centrifugation (3000 rpm, 4 • C, 5 min), 100 µL lysis buffer (2.5% sodium deoxycholate (SDC), 25 mM Tris(2-carboxyethyl) phosphine (TCEP) in 250 mM Tris pH8.5) was added to the flowthrough in the 1 mL-deep well plate at the bottom of the setup and incubated at 37 • C for 45 min.Free cysteines were alkylated with 10 mM iodoacetamide (IAM) for 30 min at room temperature in the dark.Then, 2.5 µg trypsin was added for protein digestion at 37 • C overnight.To quench the digestion, 50 µL of 10% formic acid was added to the digested samples.After centrifugation at 2000× g for 5 min, the supernatant was further cleaned using a strong cation exchange Oasis MCX 96-Well µElution Plate (Cat # 186001830BA, Waters, Milford, MA, USA).Briefly, the plate was equilibrated with methanol (MeOH) followed by 0.1% FA in deionized water, the samples were loaded, and the flow through was collected using a positive pressure plate manifold.The plate was washed with a solvent mixture of 32% acetonitrile, 32% MeOH, and 0.1% FA in water, followed by 5% acetonitrile/0.1% FA in water.Peptides were eluted from the plate cartridge by applying freshly prepared 2% NH 4 OH (pH11) in 55% acetonitrile twice and collected in a 1 mL-deep well plate.The eluted peptides were dried and reconstituted in 20 µL of 2% MeOH with 0.1% trifluoroacetic acid (TFA) for peptide concentration UV measurement (DropQuant, PerkinElmer, Waltham, MA, USA).The injection quantity for the SWATH-MS analysis was normalized to ~3.5 µg based on the peptide concentration measurement.

Liquid Chromatography and Mass Spectrometry Analysis for SWATH-DIA
The LC/MS instrument hardware setup was the same as that described in Section 2.2.6, with a modification to shorten the LC gradient length to enable higher throughput sample analysis.The resulting method could support a throughput of up to 16 samples/day.Briefly, a nanoEase m/z Peptide CSH C18 Column (130 Å, 1.7 µm, 300 µm × 150 mm, Waters, Milford, MA, USA) was used on a Dionex UltiMate 3000 RSLCnano System (Thermo Fisher Scientific, Waltham, MA, USA).Mobile phase A consisted of 0.1% formic acid and 2% DMSO in deionized water, and mobile phase B consisted of 0.1% formic acid and 2% DMSO in acetonitrile.A 73 min gradient method was used to separate the peptide samples from 1% to 7% of mobile phase B in the first 5 min at a flow rate of 3.1 µL/min, 7-32.5% of mobile phase B from 5 to 62 min at 3.1 µL/min, 32.5-57% B in 0.2 min at 3.1 µL/min, 57-95% B in 3.8 min at 4.0 µL/min.The column was washed with 95% B for 3 min and equilibrated with 1% B for 4 min before the next injection.For SWATH-MS acquisition, a 3.5 µg digested peptide, containing iRT retention time reference peptides (Biognosys) at an iRT/sample ratio of 1/10 (v/v), was injected and analyzed in the data-independent acquisition (DIA) mode using the SWATH method [16].MS1 scan was performed in the range of 360-1500 m/z in the positive ion mode.A SWATH method with 200 variable precursor isolation windows was used in this study.The MS2 window was determined by MS1 data points extracted from a DDA LC/MS run of a pooled human plasma digest sample and analyzed using the "SWATH Variable Window Calculator" Excel tool provided by SCIEX (Supplemental Material).The consecutive precursor isolation windows had a 1 amu overlap.The maximum accumulation times were 33 ms for MS1 scans and 34 ms for MS2 scans, resulting in an MS cycle time of 6.6 s.The rolling collision energy was used to determine the collision energy for each window.The mass spectrometer was calibrated using LC-MS injection of beta-galactosidase trypsin digest (SCIEX) after every two sample injections to maintain mass accuracy and resolution.To reduce run-to-run variability and potential batch effects, the injection sequence of samples from RA patients and healthy controls was randomized.Instrument QC samples (HeLa digest, 0.5 µg on column) were injected during the LC/MS batch to monitor the instrument performance.(Pierce HeLa Protein Digest Standard, 88328, Thermo Fisher Scientific, Waltham, MA, USA).

Protein Identification and Relative Quantification
Protein identification and quantification of SWATH data were performed using DIA-NN version 1.8.1 [18] with an experiment-specific spectral library from deep fractionation.Default DIA-NN settings were adopted, except match-between-runs (MBR) was enabled and protein inference was set to "Protein names (from FASTA)".Protein abundances were extracted from the report.pg_matrix.tsvfile from the output folder of the DIA-NN.Protein IDs with more than 50% missing values were removed from further analysis.Protein abundances were then log2-transformed and normalized based on sample medians using the "medianNormalization" function from the "NormalyzerDE" R package [19], and missing values were imputed using the "missForest" R package [20], which has been demonstrated as the most robust approach [21].
The same MS raw data files were analyzed with the in-silico library built based on the human isoform protein database in DIA-NN (v1.8.1) using "Isoform IDs" for protein inference, and data analysis followed the same procedure as the above-mentioned experiment-specific spectral library.An identified protein isoform should contain at least one unique peptide mapped to its unique region compared to its canonical sequence and other isoforms belong to the same gene.Therefore, a protein group containing only one protein ID was considered a protein isoform.Furthermore, non-canonical protein isoforms were determined if the corresponding UniProt accession ID contained a hyphen and a numerical suffix indicating the isoform number.The 25 identified non-canonical isoforms were manually confirmed in the UniProt database.

Data Analysis
To determine the differentially expressed proteins (DEPs) between RA patients and healthy donors, we used a linear model-based framework implemented in the "limma" R package [22].Proteins with nominal p-values less than 0.05 and a greater than 1.5-fold change in abundance were considered differentially expressed with statistical significance and dysregulated in RA.The enriched pathways in up-and down-regulated proteins were determined by over-representation analysis (ORA) using the "WebGestaltR" R package [23] against the non-redundant Gene Ontology (GO) terms of biological processes.A random forest model was implemented using the "caret" R package [24], and ROC analysis was conducted with the "pROC" R package [25].

Development of a SWATH Proteomics Workflow for Large-Scale Plasma Sample Analysis
The quality of the spectral library is a major factor that determines the sensitivity and depth of proteome coverage in SWATH DIA.Experiment-specific libraries generated locally have been reported to achieve better matching and quantitative performance than generic ones [26,27].The overall proteome coverage of study samples depends on the depth of the experimental peptide spectral library.In order to maximize the coverage of the spectral library, we designed a multidimensional fractionation strategy using pooled plasma samples for library building (Figure 1).This includes a strategy of size exclusion chromatography-based (SEC) protein-level fractionation, followed by strong anion exchange (SAX) cartridge-based peptide level fractionation and high pH reversephase peptide-level HPLC fractionation.We also employed a parallel strategy using top 14 high-abundance protein depletion at the protein level, followed by SAX and high-pH reverse-phase peptide-level fractionation.First, five plasma protein fractions were obtained from 1 mL of pooled plasma using size exclusion chromatography (SEC) fractionation.Each fraction was trypsin digested and subjected to stepwise strong anion exchange (SAX) column elution to obtain 5 peptide fractions, each of which was further separated by high-pH reverse-phase liquid chromatography (RPLC) and fractionated into 12 concatenated fractions, resulting in 300 fractions (5 SEC × 5 SAX × 12 high-pH) in total.A second library-building set was generated by removing potential interference from high-abundance plasma proteins in LC/MS analysis through depletion of the top 14 high-abundance proteins from 1 mL of the pooled plasma sample.The depleted plasma was digested and fractionated with SAX, followed by high-pH RPLC fractionation using the same procedure as the peptide-level fractionation strategy for non-depleted plasma samples, resulting in 60 fractions.A total of 360 fractions were individually analyzed by LC/MS in DDA mode.Leveraging a database search using MaxQuant, we examined the peptides identified in each fraction and removed fractions containing no unique peptides.The search results for 165 fractions from non-depleted plasma and 57 fractions from depleted plasma were merged using Spectronaut and combined to create a peptide spectral library.This library contained 1352 protein groups, which covered a broad spectrum of plasma proteins and was used in SWATH data analysis for protein identification and quantification through library matching (Figure 1).

SWATH Proteomic Analysis Identified Differentially Expressed Proteins and Associated Biological Pathways in RA
To identify protein biomarkers for RA, we performed proteomic analysis of plasma samples from 20 healthy donors and 60 RA patients using the optimized SWATH platform and an experiment-specific spectral library from deep fractionation.There are 663 proteins identified and quantified after removing proteins with excessive missing data across subjects.A principal component analysis (PCA) demonstrated that the major variances of this dataset came from intra-group sample variations, as indicated by the first principal component, and patient samples had larger variations than healthy samples (Figure 2A).The second principal component was discriminative between most RA and healthy samples, whereas the clusters were not separated explicitly (Figure 2A).A differential expression (DE) analysis was performed to identify differentially expressed proteins (DEPs) in plasma samples from RA patients.In total, 58 proteins with p-values less than 0.05 and greater than 1.5-fold changes were considered to have statistically significant differences between RA and healthy samples, including 15 proteins increasing and 43 proteins decreasing in abundance in RA (Figure 2B, Supplementary Data S1).One of the most commonly used inflammatory biomarkers, C-reactive protein (CRP), showed the highest increase in DEP in the RA plasma.Other active inflammation-related proteins, such as serum amyloid A proteins (SAA1 and SAA2) and calprotectin (S100A8 and S100A9), also showed remarkable increases in the RA plasma (Figure 2B).Interestingly, more proteins were found to decrease in abundance in the RA plasma (Figure 2B).

SWATH Proteomic Analysis Identified Differentially Expressed Proteins and Associated Biological Pathways in RA
To identify protein biomarkers for RA, we performed proteomic analysis of plasma samples from 20 healthy donors and 60 RA patients using the optimized SWATH platform and an experiment-specific spectral library from deep fractionation.There are 663 proteins identified and quantified after removing proteins with excessive missing data across subjects.A principal component analysis (PCA) demonstrated that the major variances of this dataset came from intra-group sample variations, as indicated by the first principal component, and patient samples had larger variations than healthy samples (Figure 2A).The second principal component was discriminative between most RA and healthy samples, whereas the clusters were not separated explicitly (Figure 2A).A differential expression (DE) analysis was performed to identify differentially expressed proteins (DEPs) in plasma samples from RA patients.In total, 58 proteins with p-values less than 0.05 and greater than 1.5-fold changes were considered to have statistically significant differences between RA and healthy samples, including 15 proteins increasing and 43 proteins decreasing in abundance in RA (Figure 2B, Supplementary Data S1).One of the most commonly used inflammatory biomarkers, C-reactive protein (CRP), showed the highest increase in DEP in the RA plasma.Other active inflammation-related proteins, such as serum amyloid A proteins (SAA1 and SAA2) and calprotectin (S100A8 and S100A9), also showed remarkable increases in the RA plasma (Figure 2B).Interestingly, more proteins were found to decrease in abundance in the RA plasma (Figure 2B).To further investigate the biological mechanisms related to the DEPs in RA plasma, we analyzed the enriched biological processes in terms of Gene Ontology (GO) using an over-representation analysis (ORA).The plasma DEPs increasing in abundance in RA primarily represented signals from active inflammation and immune response, such as acute inflammatory response and humoral immune response (Figure 3A, Supplementary Data S2).Particularly, neutrophil-mediated immunity was enriched in these DEPs.A diagnostic tool that has recently been investigated for RA evaluates the neutrophil-tolymphocyte cell count in the blood of patients.Several studies using this diagnostic tool have demonstrated that the ratio is much higher in patients with RA as compared to healthy individuals [28,29].Upon assessment of those DEPs up-regulated in RA samples from our studies, we found that many are known to be produced by neutrophils (S100A9, S100A8, RAB7A, DEFA1, HP, LRG1, FGB, CTSG, MPO, LTF, PGLYRP1, SAA2, ORM1, ORM2, MMP9, SERPINA1, TNC, TIMP1, and PSMA6), and several have been classified as neutrophil activation markers, including calprotectin (S100A8 and S100A9 [30] HP [31], LRG1 [32], CTSG [33], LTF [34], PGLYRP1 [35], MPO [36], and MMP9 [37]).Investigations into the direct correlations between these neutrophil activation markers in blood and the circulating frequency of neutrophils are needed to corroborate these findings.On the other hand, biological processes related to platelet functions, coagulation, and the muscle system were found to be enriched in DEPs, decreasing in abundance in RA plasma samples (Figure 3B, Supplementary Data S3).Platelets have important immune effector functions in RA, and related signaling pathways are dysregulated in the presence of pro-inflammatory molecules, such as collagen, thrombin, fibrinogen, and cytokines [38].Down-regulation of the muscle system could be related to the loss of muscle mass, which was commonly observed in the joints of RA patients [39].To further investigate the biological mechanisms related to the DEPs in RA plasm we analyzed the enriched biological processes in terms of Gene Ontology (GO) using hand, biological processes related to platelet functions, coagulation, and the muscle system were found to be enriched in DEPs, decreasing in abundance in RA plasma samples (Figure 3B, Supplementary Data S3).Platelets have important immune effector functions in RA, and related signaling pathways are dysregulated in the presence of pro-inflammatory molecules, such as collagen, thrombin, fibrinogen, and cytokines [38].Down-regulation of the muscle system could be related to the loss of muscle mass, which was commonly observed in the joints of RA patients [39].Protein isoforms in the plasma have been reported as potential biomarkers for certain diseases [40,41], but such studies are rarely found in RA.Therefore, we examined the same raw DIA MS files using an in-silico-digested human protein database containing isoform sequences and focused on protein isoform detection.Twenty-five non-canonical protein isoforms were identified, with at least one unique peptide differentiating them from other isoforms.Specifically, the protein isoforms from FN1 (P02751-11), TPM3 (P06753-2), TPM1 (P09493-5), and NME2 (P22392-2) displayed a significant decrease in abundance in RA (p < 0.05) and a > 1.5-fold decrease (Supplementary Data S4).A previous study demonstrated that the splicing machinery is impaired in RA leukocytes from peripheral blood and synovial fluid [42]; thus, the alteration of protein isoform abundance may reflect the Protein isoforms in the plasma have been reported as potential biomarkers for certain diseases [40,41], but such studies are rarely found in RA.Therefore, we examined the same raw DIA MS files using an in-silico-digested human protein database containing isoform sequences and focused on protein isoform detection.Twenty-five non-canonical protein isoforms were identified, with at least one unique peptide differentiating them from other isoforms.Specifically, the protein isoforms from FN1 (P02751-11), TPM3 (P06753-2), TPM1 (P09493-5), and NME2 (P22392-2) displayed a significant decrease in abundance in RA (p < 0.05) and a >1.5-fold decrease (Supplementary Data S4).A previous study demonstrated that the splicing machinery is impaired in RA leukocytes from peripheral blood and synovial fluid [42]; thus, the alteration of protein isoform abundance may reflect the dysfunction of alternative splicing.Interestingly, anti-TNF treatment could restore the function of the splicing machinery [42], suggesting that the protein isoforms could be good biomarker candidates to indicate the efficacy of anti-TNF treatment.Nonetheless, further investigations are needed to elucidate the biological relevance of these differentially expressed plasma protein isoforms in RA pathogenesis.

Meta-Analysis to Compare This Study with Other RA Omics Studies
To evaluate the findings from this study and the potential of differentially expressed proteins as RA biomarkers, a literature survey was performed using MS-based proteomic studies of serum/plasma, synovial fluid, and synovial tissue from RA patients.RNA sequencing data from the synovial tissues were also included.High-confidence biomarkers are preferred to have an alignment between systemic circulation and the disease site of action, instead of from circulation alone.The inclusion of synovial tissue and fluid proteomics could infer whether the differential expression of plasma proteins reflects the changes in RA patients' disease tissue.Only reports within the last decade were considered, and two additional criteria were applied to filter the studies: first, comparisons between RA and control were available; second, the full lists of differentially expressed proteins were accessible, as re-analyzing previous works using raw data was not the primary intention of this work.There were limited numbers of proteomic studies of synovial tissue or fluid, including RA and healthy samples; thus, it was also acceptable if osteoarthritis (OA) was considered as the control in the selected studies.OA is a degenerative disease of local joints with a much lower inflammation level than RA and is usually used as a control in RA studies [43].In total, we included nine reference proteomics datasets: four from serum/plasma [44][45][46][47], two from synovial tissue [48,49], and three from synovial fluid [50][51][52].We also included another two transcriptomics datasets from synovial tissue [53,54].Table 1 summarizes the number of reference datasets with the same common DEP increase in abundance in our dataset.The results showed that DEPs with the most significant increase in abundance, such as CRP, S100A9, S100A8, SAA1, and SAA2, were consistent with previous reports.Specifically, FGL1 was proposed as a novel and specific biomarker that could be clinically useful for predicting the progression of RA [46], which is also present in the DEP list from this study.In multi-omics studies using matching tissues, the mRNA level and protein expression normally have moderate correlations, with even lower correlations for the circulating protein expression level.Not surprisingly, there was a limited overlap in DEPs between this plasma proteomic study and previous synovial tissue transcriptomic datasets, indicating the distal nature of circulating protein biomarkers in the plasma of RA patients.Nonetheless, activated neutrophils have been detected in high numbers in the synovial joints and tissues of RA patients [55,56], and upon comparison with the transcriptomic profile of RA synovial fluid neutrophils [57], several of the proteins identified in the plasma of RA donors as potentially attributable to activated neutrophils were found to be expressed in this pro-inflammatory neutrophil population, including MMP9, ORM1, ORM2, S100A8, S100A9, and TKT (Supplementary Data S1).
Number in the paratheses is the number of selected papers in the corresponding category, and numbers in the table indicate the number of studies that show consistent results compared with the present study.* Comparisons between synovial tissue from RA and OA patients.# Comparisons between synovial fluid from RA and OA patients or RA and SpA patients.

Biomarker Identification Using Random Forest to Discriminate between RA and Healthy Plasma
In recent years, machine learning algorithms have been widely used in biomarker discovery for sample classification and feature selection.In this study, 70% of the samples were randomly selected to train a random forest (RF) model with a 5-fold cross-validation to distinguish RA from healthy samples.The model was validated by predicting the remaining 30% of the samples and achieved a classification accuracy of 82%.The proteins that accounted for the highest importance of model accuracy were CRP, SAA2, S100A9, IGLV1-47, APOA2, S100A8, TNC, F12, APOA1, and FHX8, which could be considered potential biomarkers (Figure 4A).The area under the curve (AUC) was also calculated for the receiver operating characteristic (ROC) analysis to evaluate the discrimination performance of these proteins.Six proteins achieved AUC greater than 0.8, including CRP, SAA2, S100A9, APOA2, S100A8, and F12 (Figure 4B), indicating that these proteins could have great power to discriminate between RA and healthy samples based on plasma protein expression.However, these proteins have been reported to be associated with inflammation and used as biomarkers for other inflammatory diseases such as inflammatory bowel disease (IBD) [58].Therefore, these proteins may not be specific enough to the disease, which limits their application as RA-specific plasma biomarkers.

Biomarker Identification Using Random Forest to Discriminate between RA and Healthy Plasma
In recent years, machine learning algorithms have been widely used in biomarker discovery for sample classification and feature selection.In this study, 70% of the samples were randomly selected to train a random forest (RF) model with a 5-fold cross-validation to distinguish RA from healthy samples.The model was validated by predicting the remaining 30% of the samples and achieved a classification accuracy of 82%.The proteins that accounted for the highest importance of model accuracy were CRP, SAA2, S100A9, IGLV1-47, APOA2, S100A8, TNC, F12, APOA1, and FHX8, which could be considered potential biomarkers (Figure 4A).The area under the curve (AUC) was also calculated for the receiver operating characteristic (ROC) analysis to evaluate the discrimination performance of these proteins.Six proteins achieved AUC greater than 0.8, including CRP, SAA2, S100A9, APOA2, S100A8, and F12 (Figure 4B), indicating that these proteins could have great power to discriminate between RA and healthy samples based on plasma protein expression.However, these proteins have been reported to be associated with inflammation and used as biomarkers for other inflammatory diseases such as inflammatory bowel disease (IBD) [58].Therefore, these proteins may not be specific enough to the disease, which limits their application as RA-specific plasma biomarkers.

Discussion
In this study, we analyzed 80 plasma samples from RA patients and healthy donors using an optimized SWATH DIA proteomics strategy.The comprehensive experimentspecific spectral library from deeply fractionated pooled samples, plate-based sample preparation procedure, and robust LC-MS system enabled high-throughput deep proteome coverage.Among the 663 proteins identified and quantified, 58 were found differentially expressed in RA with statistical significance.The DEPs that were increased in RA were mostly linked to active inflammation and immune response, and DEPs with decreased abundance in RA presented platelet and coagulation dysfunction and a reduction in muscle mass.Our results, particularly the increase in DEPs in RA, showed explicit agreement with other recent plasma proteomics studies.Many of the DEPs that increased in abundance in RA plasma were also found to be more abundant in RA synovial fluid and synovial tissue, indicating that these plasma proteins could reflect the disease-related alterations in protein expression in the joints of RA patients.A random forest model trained with this dataset was able to discriminate RA samples from healthy samples with an accuracy of 83%, and the proteins with the highest importance to the model accuracy also showed great discriminative power according to the ROC analysis.This study provides a feasible option for implementing high-throughput SWATH-DIA proteomics technology to analyze plasma samples and generate high-quality quantitative data with good proteome coverage.
In this SWATH-DIA workflow, the proteome coverage was considerably higher than that of traditional DDA analysis.Compared with the general average of 200-500 plasma proteins identified [45,47,59,60], 663 proteins were identified in this study, indicating a remarkable increase in the depth of proteome coverage.Several steps were taken to improve proteomics analysis sensitivity and data quality: 1. DMSO as an additive to the HPLC mobile phase [16,61], 2. Capillary flow HPLC method [62].The flow rate range of 3-5 µL/min is a sweet spot where the system can still fit a regular ESI source without a true nano ESI source, while the system can still benefit from sensitivity improvement at a lower flow rate.In contrast, at a high flow rate of 20-50 µL/min, the sensitivity approaches a regular analytical flow LC/MS without the benefit of low flow rate sensitivity improvement; 3. Variable window SWATH method.Because the peptide abundance across the m/z range is not evenly distributed for trypsin-digested samples, an evenly distributed SWATH window will result in overcrowding of peptides in certain ranges, leading to complex chimeric spectra and lower library match data quality, while under-sampling in other m/z ranges with fewer abundant peptides.To increase the specificity of the DIA MS2 analysis and avoid the uneven distribution of MS2 resources, a variable SWATH window strategy was used.Based on TripleTOF vendor specifications, 200 is the maximum allowable window size.However, to accommodate all 200 windows without exceeding the allowable MS cycle time, careful balancing of the window size, accumulation time, and overall MS cycle time is needed.Attempts to increase the flow rate, sharpen the peak shape, and maximize the number of data points per peak were made.The final method has a total MS cycle time of 6.7 s which can support 7-8 data points per HPLC peak as a balance of sensitivity, proteomics coverage, robustness, and quantitative performance without sacrificing system stability and robustness.
Plasma is a complex biological matrix.There are multiple challenges in large-scale plasma proteomics sample preparation: 1. high-abundance plasma proteins and large dynamic range of plasma proteins concentrations (>12 orders of magnitude) [6]; 2. high concentrations of lipids (~5 mg/mL) [63]; 3. sample cleanup is critical after reduction and alkylation before LC/MS analysis.To achieve reproducible and robust plasma proteomics analysis, we adopted a plate-based workflow to address these challenges, which include: 1. plate-based top 14 high-abundance protein depletion; 2. mixed mode solid-phaseextraction plate-based desalting and sample clean-up.In the field of mass spec proteomics, most of the studies were conducted using nanoflow HPLC to achieve maximum sensitivity.However, there are multiple disadvantages to using nano-LC platforms, such as long LC gradients, slower sample loading time, lower robustness, shorter column life, etc.The capillary HPLC system was exceptionally suitable for the Sciex TripleTOF system [16].The column life can exceed 500 injections (internal data), which is highly desirable for complex sample types such as human plasma with large cohort sizes.
More importantly, the run-to-run consistency of the current platform is worth considering.The proteome coverage of a cohort study typically consists of three metrics: 1. the overall proteome coverage from all subjects in a study, which is the highest protein ID due to the stochastic nature of the mass spec proteomics analysis; 2. the average protein ID for each subject: 3. the overlapping protein ID across all subjects in the study, which is the lowest because of missing values across the cohort.In a quantitative proteomics-based biomarker discovery, the most valuable metric is the overlapping protein ID, which indicates consistent detection and quantitation across samples and groups.Many plasma proteomics studies have focused on the total coverage from the entire study but overlooked the coverage variability across samples [44], with a dramatic difference observed between total protein ID and overlapping protein ID.Therefore, the total protein ID is often mis-leading.The current platform not only generates good average proteome coverage but also contains exceptionally low missing values, making biomarker discovery more reliable.
For small-quantity proteomics applications such as single-cell or laser capture microdissection, the SWATH-DIA platform is not suitable; however, for applications where sample quality is not limited, such as plasma sample analysis, this platform is appropriate with good reproducibility and low cost.In comparison to other DIA platforms (timsTOF and Orbitrap), the current capillary flow TripleTOF SWATH-DIA platform has similar proteome coverage for plasma proteomics analysis with a similar starting volume of plasma samples.However, the peptide mass spec injection quantity requirement (3-4 µg) is approximately 5× higher than orbitrap (0.5-1 µg) and 10× higher than that of timsTOF (0.3 µg) mainly due to the lack of true nanoflow and the lack of trapping function of the TripleTOF instrument.While the overall yield of plasma proteomics sample preparation is more than sufficient to support the 4 µg injections, a higher injection quantity may lead to faster instrument contamination and performance deterioration, which should be considered for large-scale cohort analysis.To this end, the entire workflow can be easily transferred (including sample preparation and data analysis) when the mass spectrometry instrument itself is upgraded to more sensitive options.
The comprehensiveness of the spectral library is a major factor for protein detection in DIA-based proteomics.In this study, the project-specific spectral library derived from 222 fractions presented a deep coverage of plasma proteins and contributed to the identification of 663 proteins.The generation of a spectral library from deep fractionations can be time consuming, cost effective, and limited by the availability of extra samples from the same study.Nonetheless, it is beneficial for future studies with the same sample type as the spectral library.On the other hand, generating spectral libraries from in silico digestion of the FASTA protein databases has recently become increasingly popular and demonstrated equivalent performance to the DDA libraries [18,64].We also evaluated the in silico spectral library generated by DIA-NN and it achieved overall similar proteome coverage as the DDA library, suggesting that the in-silico library could be a viable alternative if the experiment-specific library is unfeasible.However, the number of proteins identified in the in-silico library varied significantly between individual samples.Interestingly, fewer proteins were identified using the merged DDA and in silico libraries; thus, merging the DDA and in silico libraries is not pursued for the current study.Based on this learning, the effort of protein isoform analysis was entirely based on in silico libraries without rebuilding peptide spectral libraries based on DDA runs.Available alternative splicing isoform analyses using mass spectrometry proteomics from human plasma are limited.The fact that the current platform can analyze splice variants exemplifies its unique capabilities.
The latest build of the Human Plasma Proteome Project has 4395 canonical proteins [65], with an average of ~450 proteins from each experiment and even lower protein coverage per sample.The current paper presents the advancement of shotgun plasma proteomics (bottom-up approach) by using a comprehensive peptide library and DIA approach.However, the shotgun approach still only covers a narrow range of protein sequences (1-2 peptides) for the low-abundance proteins.Although only one unique peptide per protein was sufficient for protein inference from peptide-spectral matching (PSM), the detected peptides did not necessarily contain unique isoform sequences.On the other hand, the possible number of proteoforms in human plasma is much higher [66,67].
Proteoforms are the different forms of a protein with a variety of sequence variations, splice isoforms, and post-translational modifications [68,69].One of the most important sources of proteoform complexity is alternative splicing.Some genes can produce multiple isoforms of the same protein through alternative splicing or transcription initiation, which might have distinct functions and properties.Differential protein expressions related to disease may not be manifested at the overall protein expression level but at the splice variant isoform level [70].Although trypsin digestion-based shotgun proteomics provides high coverage of canonical protein sequences, the detection and quantitation of splice variants using shotgun proteomics is limited as the unique sequences of splice variants are often masked by the trypsin digestion site [71].Protein post-translational modification is another source of proteoform complexity in disease states [72].The most common PTM in plasma proteins is glycosylation [73], while the measurement of glycopeptides using the shotgun proteomics method is a specialty by itself, with dedicated sample preparation, chromatography, and mass spec fragmentation mechanisms [73,74].
An intact analysis of plasma proteins is more appropriate to investigate disease-related proteoforms.However, the analysis and quantification of proteoforms at the intact level is highly challenging.Among the different analytical options, highly sensitive reagent-based assays, such as SomaScan or Olink, require the development of proteoform-specific reagents (aptamers or antibodies) for each proteoform.While mass spectrometry is highly capable of analyzing proteoforms and there has been tremendous progress in the field of top-down proteomics [75], the coverage of top-down proteomics analysis is still too low to be used as the primary biomarker discovery tool [73].Most of the advances are using top-down mass spectrometry to study low-molecule-weight proteoform species [76] or based on sample fractionation [77] or immunoaffinity sample enrichment of the targeted proteoform analysis [78].Another limitation of intact proteoform analysis is for glycoproteins, as the complexity of glycoforms could be overwhelming [79].
One possible strategy could be using shotgun proteomics results to nominate proteoform biomarker candidates, for example, using shotgun proteomics to identify splice variants [80].The shortlist of isoform biomarker candidates could be validated by using antibody-based immunoaffinity enrichment, followed by targeted intact protein/top-down MS analysis [78].The glycoprotein proteoform can be analyzed using shotgun glycoproteomics [73] or intact protein analysis for purified proteins [81].To achieve higher splice variance in protein isoform detectability, the proteomics platform needs to have good overall protein sequence coverage and peptide level coverage.The current study provides an example of improved peptide-level proteome coverage.One explanation is that the current platform is based on an in-depth peptide-spectral library building, which helps provide a large splice isoform-specific peptide repertoire.There are still many limitations to this approach, but it provides a starting point for the comprehensive proteoform biomarker discovery process.
In addition to the technical improvements in workflow and detection sensitivity, the current study also highlights that these modifications can be implemented without sacrificing the ability to detect biologically meaningful differences in the proteomic profile of plasma from RA patients compared to healthy donors.A differential expression analysis and pathway over-representative analysis revealed that the change in protein expression profiles in RA plasma primarily reflected active inflammation and immune response, as well as the repression of platelet functions (Figures 2 and 3).Some of the DEPs that increased in abundance in the plasma may indicate enhanced expression of the same proteins in synovial tissues and fluids of RA patients (Table 1).Particularly, the expression of calprotectin (S100A8 and S100A9) showed the capability to discriminate between RA and healthy plasma samples (Figure 4).However, most of the upregulated proteins detected in this study were typical inflammatory markers that are not specific to RA, thus undermining their value as diagnostic biomarkers for RA.For example, inflammatory bowel disease (IBD), which includes Crohn's disease and ulcerative colitis, is a chronic inflammatory disorder of the gastrointestinal tract.Studies have reported that C-reactive protein, serum amyloid A, and calprotectin all show diagnostic value for IBD with high sensitivities and specificities [82][83][84][85].Therefore, non-inflammatory RA-specific biomarkers require further investigation.A comparison of protein expression profiles between RA and other inflammatory disorders, such as asthma and IBD, could potentially lead to more specific biomarker discovery for RA [58].Despite this, these proteins can potentially be used as generic indicators of treatment effectiveness for these inflammatory diseases.
Besides the canonical proteins, this study identified 25 non-canonical protein isoforms from human plasma samples and 4 of these, FN1 (P02751-11), TPM3 (P06753-2), TPM1 (P09493-5), and NME2 (P22392-2), showed greater than 1.5-fold change decreases in RA with statistical significance (Supplementary Data S4).Fibronectin (FN1) has 15 isoforms from alternative splicing, and different isoforms have been reported to be present in synovial fluid from RA patients [86].Interestingly, fibronectin has been found to promote differentiation and mineralization of osteoblasts [87], thus the decrease in fibronectin isoforms in RA plasma may indicate dysfunction of osteogenesis.Further investigation of the isoforms dysregulated in RA could reveal more biological mechanisms.
In addition to LC/MS-based proteomics, capture reagent-based proteomics platforms, such as OLINK and SomaScan, are popular options for analyzing plasma proteins for biomarker discovery.These platforms provide a wide range of options for quantifying hundreds to thousands of plasma proteins from a single sample and cover broad dynamic ranges to enable the detection of low-abundant proteins [4,5,88].Compared to these platforms, LC/MS proteomics is not limited by the availability of detection reagents but lacks the sensitivity to identify low-abundance proteins.Therefore, LC-MS and capture reagentbased proteomics platforms are complementary.Mass spec-based plasma proteomics has witnessed tremendous progress in recent years through the development of deep fractionation and efficient depletion.However, nanoparticle-based protein enrichment has been the most impactful development based on nanoparticle-based protein enrichment [12].Based on a panel of five nanoparticle types, the Seer platform can routinely analyze >1500 protein groups in human plasma using 250 µL of samples.Continuous improvements in sensitivity and throughput in mass spectrometry will enable routine analysis of human plasma at >2000 protein groups [81,89].However, considering the cost and volume requirements of the Proteograph platform and other equivalent platforms, the approach presented here still has good value and suitability for plasma biomarker discovery in certain applications, especially for mouse plasma biomarker discovery applications where the plasma volume is limited and the suitability of the nanoparticle panel for mouse protein is still untested.
Although the platform presented here was developed for human plasma, it can be readily deployed for other matrixes for biomarker applications.Matrixes other than plasma are actually less challenging and the sample preparation procedure can be further simplified.An attractive biomarker discovery and validation approach could be using the high-throughput proteomics platform presented in this study to quickly generate biomarker "hits" using secreted matrixes, such as plasma, CSF, and feces, together with tissue of "siteof-action" and confirmed using targeted protein quantitation assays, for example, LC/MS or ligand binding assays in the secreted matrices through a more comprehensive study design.The benefit of this approach is its low cost, independence of reagent availability, and broad applicability in various matrices.

Conclusions
In summary, we analyzed plasma samples from RA patients and healthy donors using an improved SWATH-DIA proteomics workflow.A good proteome coverage of 663 proteins was achieved with low missing values.The change in protein expression in RA plasma mostly reflected active inflammation and immune response, as well as inhibition of platelet activities and loss of muscle mass.Several differently expressed proteins were found to be consistent with previous reports.Compared to the workflows that require high costs either on assays (NPs) or LC-MS instruments (Evosep LC, timsTOF), the current workflow is a viable option for plasma proteomics with satisfactory throughput and proteome coverage to enable biomarker discovery applications with cost effectiveness, and it can be easily expanded to plasma proteomic studies for other species and disease indications without sample volume restrictions.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/proteomes11040032/s1.A Supplementary Data file is available with the following information.Supplementary Data S1: Result of differential expression analysis of all proteins comparing RA vs. Healthy plasma samples.Supplementary Data S2: List of over-representative Gene Ontology Biological Processes among DEPs increasing in abundance in RA (p-value < 0.05).Supplementary Data S3: List of over-representative Gene Ontology Biological

Proteomes 2023 , 22 Figure 1 .
Figure 1.Schematic of the optimized SWATH proteomics workflow.(A) DDA spectral library construction procedure from pooled plasma samples.(B) plasma sample preparation and DIA proteomics procedure using DDA spectral library from (A).

Figure 1 .
Figure 1.Schematic of the optimized SWATH proteomics workflow.(A) DDA spectral library construction procedure from pooled plasma samples.(B) plasma sample preparation and DIA proteomics procedure using DDA spectral library from (A).

Figure 2 .
Figure 2. Proteomics analysis in RA and healthy groups.(A) principal component analysis (PCA plasma protein expression pattern from RA and healthy samples.(B) volcano plot highlight DEPs comparing RA against healthy samples.The horizontal red dashed line indicates p-valu 0.05, and the vertical red dashed line indicates a 1.5-fold change in protein abundance.

Figure 2 .
Figure 2. Proteomics analysis in RA and healthy groups.(A) principal component analysis (PCA) of plasma protein expression pattern from RA and healthy samples.(B) volcano plot highlighting DEPs comparing RA against healthy samples.The horizontal red dashed line indicates p-value = 0.05, and the vertical red dashed line indicates a 1.5-fold change in protein abundance.

Figure 3 .
Figure 3. Enriched GO biological processes ORA in DEPs.(A) top 15 GO biological processes by in DEPs increasing in abundance ranked by false discovery rate (FDR).(B) top 15 GO biological processes in DEPs decreasing in abundance ranked by FDR.

Figure
Figure Enriched GO biological processes ORA in DEPs.(A) top 15 GO biological processes by in DEPs increasing in abundance ranked by false discovery rate (FDR).(B) top 15 GO biological processes in DEPs decreasing in abundance ranked by FDR.

Figure 4 .
Figure 4. Biomarker discovery by random forest.(A) top 10 proteins with highest feature importance in determining model accuracy.(B) ROC plots of proteins with AUC greater than 0.8 in distinguishing RA from healthy samples.

Figure 4 .
Figure 4. Biomarker discovery by random forest.(A) top 10 proteins with highest feature importance in determining model accuracy.(B) ROC plots of proteins with AUC greater than 0.8 in distinguishing RA from healthy samples.

Table 1 .
List of common DEPs increasing in abundance between the current proteomics dataset and other previous proteomics and transcriptomics datasets.

Symbol Serum/Plasma Proteomics (4) Synovial Tissue Proteomics * (2) Synovial Fluid Proteomics # (3) Synovial Tissue Transcriptomics (2)
Number in the paratheses is the number of selected papers in the corresponding category, and numbers in the table indicate the number of studies that show consistent results compared with the present study.* Comparisons between synovial tissue from RA and OA patients.# Comparisons between synovial fluid from RA and OA patients or RA and SpA patients.